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Abstract 

Background: Abnormal epigenetic marking is well documented in gene promoters of cancer cells, but the study 
of distal regulatory siteshas lagged behind.We performed a systematic analysis of DNA methylation sites connected 
with gene expression profilesacross normal and cancerous human genomes. 

Results: Utilizing methylation and expression data in 58 cell types, we developed a model for methylation- 
expression relationships in gene promoters and extrapolated it to the genome. We mapped numerous sites at 
which DNA methylation was associated with expression of distal genes. These sites bind transcription factors in a 
methylation-dependent manner, and carry the chromatin marks of a particular class of transcriptional enhancers. In 
contrast to the traditional model of one enhancer site per cell type, we found that single enhancer sites may 
define gradients of expression levels across many different cell types. Strikingly, the identified sites were drastically 
altered in cancers: hypomethylated enhancer sites associated with upregulation of cancer-related genes and 
hypermethylated sites with downregulation. Moreover, the association between enhancer methylation and gene 
deregulation in cancerwas significantly stronger than the association of promoter methylationwith gene 
deregulation. 

Conclusions: Methylation of distal regulatory sites is closely related to gene expression levels across the genome. 
Single enhancers may modulate ranges of cell-specific transcription levels, from constantlyopen promoters. In 
contrast to the remote relationships between promoter methylation and gene dysregulation in cancer, altered 
methylation of enhancer sites is closely related to gene expression profiles of transformed cells. 

Keywords: Cancer, DNA methylation, distal control elements, enhancers, epigenomics, gene-enhancer pairing, 
gene regulation, machine-learning 
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Background 

DNA methylation is a key determinant of regulatory 
chromatin complexes transmitted through cell divisions. 
Relationships between DNA methylation and gene 
expression levels were first recognized at gene promo- 
ters and CpG islands [1], but were recently observed 
across the genome [2-6]. In gene promoters, DNA 
methylation mediates silencing by altering the binding 
of transcription modulators to their DNA targets [7], 
but the cause and function of expression-associated 
methylation away from gene promoters has been elusive. 
The relationships between aberrant DNA methylation 
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and the altered expression profiles of cancer cells are 
also not well understood: the predominant pattern in 
gene promoters is denovo methylation of polycomb- 
repressed genes [8-11]. Since these genes are already 
inactive in the normal tissue and generally remain inac- 
tive in the cancer [12,13], it is hard to establish direct 
contribution to the cancer process [14]. Away from gene 
promoters, perturbed methylation has been associated 
with reduced genomic stability [15,16] and global silen- 
cing of large chromatin domains [17,18], but the effect 
on transcription of particular genes remains unknown. 

Transcriptional enhancers support tissue-specific 
expression profiles through physical interactions with 
gene promoters. Based on analyses of chromatin struc- 
tures, hundreds ofthousands of transcriptional enhancers 
were predicted in mammalian genomes (reviewed in 
[19]). These sites bind chromatin-modulating factors, 
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interact with distal promoters through DNA loops, and 
demonstrate a unique pattern of DNA methylation [20]. 
In some particular examples, enhancer sites turned out 
to be less methylated in cells expressing the controlled 
genes than in non-expressing cells [21-24]. Taking a 
more systematic approach, Wienchef al. recently revealed 
activity-dependent methylation in a group of distal 
enhancers in the mouse [25], and Bock et a/. [26] demon- 
strated correlations between enhancer methylation and 
expression of particular developmental genes during 
mouse tissue differentiation. 

We conducted a methodical analysis of distal DNA 
methylation sites associated with gene expression in nor- 
mal and cancerous human cells. The results suggest that 
enhancer methylation corresponds closely with expres- 
sion profiles of cancer genes in transformed cells. 

Results and discussion 

To explore relationships between DNA methylation and 
gene expression levels across the genome, we took the fol- 
lowing strategy. First, we developed a model for the rela- 
tionships between methylation in gene promoters and 
gene expression using a machine-learning algorithm. 
Then, we applied the derived model to methylation sites 
away from gene promoters and characterized the chroma- 
tin structure and binding profile of the discovered sites. 
Finally, we explored how the identified sites were altered 
in cancer. 

We analyzed available DNA methylation and gene 
expression data for 58 human cell types (Table SI in Addi- 
tional file 1). The methylation data were produced by two 
different assays: reduced representation bisulfite sequen- 
cing (RRBS) andlnfinium HumanMethylation450 Bead- 
Chip(Illuminainc, San Diego, CA, USA). Following 
verification of high agreement between the assays (Figures 
S1A and S2A, Bin Additional file 1), we combined the two 
datasets. We then analyzed variation in methylation levels 
across the panel of cell types (see Materials and methods 
for detailed description). Sites with methylation levels that 
did not change across the cell types were eliminated, and 
the remaining 670, 906 variable methylation sites (VMSs) 
were used in the study (Figure SIBin Additional file 1). 
VMSs in gene promoterswere negatively correlated with 
expression whereas VMSs in gene bodies were positively 
correlated with expression (Figure S2 in Additional file 1), 
as previously reported [6] . The connection with expression 
levels was most evident at -500 to +2, 000 base pairs rela- 
tive to transcription start sites (TSSs) (Figure S2A-Cin 
Additional file 1). Therefore, we used the CpG methyla- 
tion sites in this range for model development. 

Promoter-based model of regulatory methylation sites 

As a first step in our study, we tested whether we could 
couple promoters with the appropriate genes based 



solely on DNA methylation and gene expression data. 
We identified 15, 205 VMSs in promoters of 3,434 
genes. These gene-CpG pairs provided the 'true' methy- 
lation-expression sample inputs (Figure 1A). We then 
trained a machine-learning algorithm to discriminateth- 
ese true gene-CpG pairs out of an excess offalse (rando- 
mized)pairs, produced by matching promoter CpGs with 
genes from other chromosomes. Through rounds of 
training and test sets, the algorithm optimizes para- 
meters of linear and monotonic correlations between 
methylation and expression, allowing the best discrimi- 
nation of the true pairs (Figure IB). The output was a 
general model for the methylation-expression relation- 
ship in gene promoters (including promoters with or 
without CpG islands). Based on the learned model, we 
further produced a score for each of the gene-CpG 
pairs. At score >0.85, the model successfully paired 
87.2% ofthe genes to their actual promoters, compared 
with a null expectation of 50% (Figure 1C). The sensitiv- 
ity of the test under these conditions was 2.63% (that is, 
predictions were made for 2.63% of the promoter 
VMSs), and the false discovery rate was 12.8%. Thus, 
the developed model successfully paired promoter 
methylation sites with their genes based on methylation 
and expression data alone. 

Relating genes to distal regulatory sitesusing DNA 
methylation 

We then extended the analysis to the VMSs residing 
from one megabases (Mb) upstream of the TSSs through 
one Mb downstream of the transcription end sites of 
17,862 human genes (Figure 2A). We chose to exclude 
more distant sites to decrease complexity and false posi- 
tives, and becauseit was previously shownthat distal regu- 
latory elements tend to concentrate in the global 
vicinities (usually within a few hundredkilobases(kb))of 
the targeted genes [27,28]. Out of 14,702,075 analyzed 
CpG-gene pairs, 2,824 pairs obtainedhigh scores (score 
>0.9) according to the model. The distribution of methy- 
lation-expression relationships of the high-scoring sites 
over the cell types is shown in Figure 2B. High-scoring 
sites were significantly more frequent (P <10~ ) in actual 
gene intervals than in random intervals (produced by 
shuffling the expression data of the actual gene with 
expression data of foreign genes taken from other chro- 
mosomes). The entire list of CpG-gene pairs at score 
>0.75 is given in Additional file 2. 

We then sorted out the sites within 5 kb of promoters or 
alternative promoters of the associated genes. The remain- 
ing 1,911 pairs included 486 unique genes corresponding 
with 1,041 unique distalmethylation regions (we arbitrarily 
clustered CpGs with <3 kb between them into a single 
region). On average, we obtained 2.14distal regions and 
3.93 methylation sites per associated gene. The following 
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Figure 1 Promoter-based model of methylation-expression relationships at regulatory sites (A) A typical gene-CpG sample pair, 
conveying the methylation of a promoter-based CpGsite in relation to the expression levels of its linked gene across cell types.(B) A machine- 
learning algorithm (SVM-MAP) was trained to distinguish true gene-CpG pairs out of 50-fold excess of false (randomized) pairs. Through rounds 
of training and test sets, the algorithm optimized parameters of linear (Pearson coefficient) and monotonic (Spearman) correlations to provide 
the best discrimination between true and false pairs, producing a general model for methylation-transcription relationships in gene promoters. 
Based on fitting with the learned modefa score was assigned to each gene-CpG pair. (C) Rates of successful gene-promoter pairing as a 
function of thresholds on the scores (null expectation = 50%). At score >0.85, the model successfully paired 87.2% ofthe genes to their actual 
promoters (dashed lines). 



analyses were done on these 1,911 high-scoring CpG-gene 
pairs. 

To confirm that the high-scoring methylation sites 
wereindeed regulatory sites, we analyzed their chromatin 
signatures. High-scoring sites were enriched by histone 



modifications that have been previously associated with 
transcription activation, including histone 3 lysine 4 
methylations, lysine 27 acetylation, lysine 9 acetylation 
and histone variant H2A.Z, but not with CTCF binding 
(Figure 2C). Further analysis of chromatin states defined 
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Figure 2 Mapping and validating distal regulatory sites using DNA methylation. (A)Mapping strategy: a model for methylation-expression 
relationships in gene promoters was applied to VMSs from 1 Mb upstream through 1 Mb downstream of 17,862 genes. (B)Distribution of 
methylation-versus-expression levels for high-scoring gene-CpG pairs (score >0.9, n = 2,824). (C)Relative enrichment of chromatin factors around 
the high-scoring methylation sites (n = 1,911), excluding sites in the promoters of the associated genes. Data were normalized to 0 to 1 scale.(D) 
Fold enrichment of methylation sites {n = 2,824) in actual gene intervals (real), versus the null expectation based on random permutations of 
gene expression data (random), of chromatin states defined by the ChromHMM algorithm [27].(E)Left: Number of transcription factors binding to 
the high-scoring sites, compared with random expectations. RightNumber of transcription factors binding to unmethylated or methylated 
enhancers, compared with random expectations. Averages of four cell types for which methylation and binding data were available (GM12878, 
HepG2, HeLaS3, K562) are shown.Sites in the promoters of the associated genes were excluded.(F) Evolutionary sequence conservation around 
the top-scoring sites. The analyses shown in D and E excluded all sites at ±5 kb from TSSs. TF, transcription factors. 



by the ChromHMM algorithm (based on various chro- 
matin factors [27]) revealed significant enrichment of 
the high-scoring sites within promoters and within a 
particular class (ChromHMM state 4) of strong enhan- 
cers (P = 2.4e~ 15 ; Figure 2D). In agreement with this, 
VMSs were hypomethylated when included in enhancer 
chromatin, compared to the methylation of the same 
sites in cells where they were included in non-regulatory 
chromatin (Figure S3 in Additional file 1). In addition, 
the highest frequency of hypomethylated sites was 
observed in state 4 enhancers (Figure S3Cin Additional 
file 1). We also analyzed the protein-binding capacities 
of the identified distal sites. The mapped methylation 
sites bind a larger number of transcription factors than 
expected. Moreover, unmethylated sites bind more fac- 
tors than methylated ones (Figure 2E). Finally, we found 
that the high-scoring methylation sites are evolutionarily 



conserved (Figure 2F). Together, these findings suggest 
that distal expression-related methylation sites populate 
a particular class of transcriptional enhancers, which 
bind transcription factors in a methylation-depended 
manner. 

Validation of enhancer-promoter pairing 

To validate distal enhancer-promoter interactions, we 
compared our predictions with a recent comprehensive 
study of long-range DNA interactions, measured bythe 
chromosome conformation capture carbon copy (5C) 
technique in three cell types [29]. We identified enhan- 
cer sites (n = 318)assessed by both methylation score 
and 5Cprobes, and determined whether the methyla- 
tion-based gene-enhancer pairs also reveal physical 
DNA interactions. Between-assay agreements were 
markedly increased with methylation scores:at scores 
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>0.85, 5C interactions were found in 53% of the cases, 
compared with 29% of lower score interactions (P = 
0.021). A representative example of a long-range enhan- 
cer-promoter interaction captured by the both assay is 
shown in Figure S4 in Additional file 1. We also ana- 
lyzed the frequency of DNA interactions in 21 enhan- 
cer-TSS pairs predicted by our assay, versus the 
interactions of the same enhancers with the other 264 
TSSs in the regions (10 kb to 1 Mb from the enhancer 
site). The frequency of 5C interactions was significantly 
higher with the TSS predicted by our assay than with 
the other TSSs (308.5 ±738.2 junction sequence reads in 
the high-scoring pairs, versus 90.7 ±277.7 reads in the 
other pairs, P = 0.009). Thus, thelarge-scale looping ana- 
lysis affirms the fidelity of enhancer-promoter pairings 
based on DNA methylation signals. Given that our 
study examined many more cell types than the three 
analyzed in the 5C study, and thus could identify addi- 
tional enhancers not active in these three cell types, this 
level of between-assay agreement is probably an under- 
estimation of the actual degree of valid pairs. 

Since certain epigenetic profiles may not be fully pre- 
served in cell lines [6], we also confirmed the ability to 
map enhancer-promoter interactions in freshlyobtained 
(uncultured) tissue samples. Analysis of seven tissue 
biopsies from two donors [30]([GSE:30654]) revealed 
significant intersection (P <0.01) with the enhancer-gene 
pairs observed in the original set of 58 cell types (see 
Materials and methods for detailed description). Thus, a 
significant number of the gene-enhancer interactions 
predicted in cell cultures were replicated in uncultured 
tissue samples. This analysis also indicates a significant 
degree of overlapping between gene-enhancer pairing in 
diverse tissue collections. 

Enhancer methylation defines gradients of cell-type 
transcription levels from permissive promoters 

Interestingly, many of the promoters for which we identi- 
fied high-scoring enhancers were constantly unmethy- 
lated across all cell types, regardless of expression levels 
(Figure 3A,B and Figures S4 and S5 in Additional file 1). 
This finding does not contradict the bulk of evidence 
suggesting that methylation sites in and around promo- 
ters tend to correlate with expression levels (our model 
was in fact built based on this general phenomena). 
Instead, it suggests that in cases of permissive - but not 
necessarily active - promoters, enhancer methylation may 
serve as a main determinant of gene transcription levels. 
Moreover, enhancers tend to show gradual, rather than 
distinct, methylation levels across cell types, in tight cor- 
relation with gradients of expression (note that our 
model does not enforce these gradients). Enhancers have 
previously been connected with cell-type-specific 



transcription patterns. However, it was generally assumed 
that sets of enhancers, each of them active in a particular 
tissue or tissues, are needed [31]. According to this 
model, we would expect that a given enhancer site would 
be fully methylated in most of the cell types, and fully 
unmethylated in some. The tendency of the identified 
sites to show gradual methylation differences between 
cell types supports an alternative model, in which even a 
single enhancer site can mediate distinct transcrip- 
tion levels over wide ranges of tissues and cell types 
(Figure 3C). 

Altered enhancer methylation predicts changes in the 
expression profiles of cancer genes 

Altered histone modifications have recently been demon- 
strated in enhancers in cancer cells [32]. We therefore 
explored whether methylation of enhancer DNA is also 
altered in cancers. We analyzed the methylation levels of 
the 1,911 distal methylation sites associated with the 
expression of 486 genes in normal human mammary 
epithelial cells versus mammary glandadenocarcinoma 
(MCF7) cells. A subset of the predicted enhancers was 
hypo- or hypermethylated in the cancer cells compared 
to the normal cells. The genes associated with hypo- 
methylated enhancers tended to be upregulated in the 
cancer, and the genes associated with the hypermethy- 
lated enhancers tended to be downregulated (Figure 4A, 
upper row). Moreover, altered gene expression was better 
correlated with altered enhancer methylation than with 
altered promoter methylation (R = -0.37 versus R = -0.16, 
respectively), and the effect of methylation differences on 
gene expression was significantly higher in enhancer sites 
than in promoter sites (P = 3.55e- ). Thus, enhancer 
methylation is remarkably altered in cancer and, more- 
over, enhancer methylation is more closely related to 
changes in gene expression than promoter methylation. 

To examine whether these patterns are exclusive to 
mammary cells, we repeated the analysis in lung epithe- 
lia. The analysis of normal human bronchial epithelial 
cells versus epithelial lung carcinoma (A549) cells 
revealed similar relationships between enhancer and pro- 
moter methylation and gene expression (Figure 4A, mid- 
dle row). We have further analyzed a set of 18 normal 
cell types of various tissue origins versus 18 cancers of 
various tissue origins (Table S2 in Additional file 1). 
Once again, the methylation of enhancers was drastically 
altered and was tightly connected with alteration of gene 
expression, whereas promoters showed smaller altera- 
tions and weaker correlation with expression (R = -0.51 
versus R = -0.24); P <le" ; Figure 4A, lower row). Thus, 
alteration of enhancer methylation is common to many 
cancer types, and is associated with substantial changes 
in gene expression. 
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Figure 4 Altered enhancer methylation predicts changes in the expression profiles of cancer genes (A) Each of the x-y scatters shows 
the difference in gene expression between normal and cancer cells as a function of the difference in methylation levels (for example, a 
difference of +100 indicates that the given site was 0% methylated in the normal cells and 100% methylated in the cancer). The plots on the 
left are for high-scoring enhancer sites associate with 486 genes.the right plots show the promoters of 394(out of 486) genes that were 
associated with enhancers, and theplots in the middle show all promoters.Blue dots and clouds donate the frequencies of methylation-versus- 
expression differences. Linear regression (red lines) and Pearson coefficients (R values) are shown for each scatter. (B) Same as in A, but for the 
505methylation sites (out of the 1,911) in state 4 enhancers. (C)Left:Overlap between the genes (score >0.85) in the upper-left quadrants (that is, 
genes thatwere upregulated by >0.25 expression units withdistal enhancers that were hypomethylated by >25%) in breast, lung, and the 
collection of 18 normal versus cancer cell types. The numbers of overlapping genes are indicated. Right: Examples of GO groups that 
significantly enriched among the 207 genes that were upregulated and hypomethylated in the various cancer types (the entire list is provided in 
Table S4 in Additional file 1). 
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Since our enhancers were specifically enriched in the 
chromatin class defined as ChromHMM state 4 (Figure 2C), 
and since enhancers in this state showed the highest methy- 
lation differences compared to non-regulatory chromatin 
(Figure S3 in Additional file 1), we hypothesized that state-4 
enhancers would show the closest association with changes 
in gene expression in cancers. Indeed, this group of enhan- 
cers showed an even tighter correlation with gene expres- 
sion levels compared with the entire group of enhancers 
(R = -0.58 versus R = -0.51; Figure 4B). 

We also sought to analyze what characteristics the 
altered genes might share in addition to association with 
enhancer methylation. For this, we first analyzed whether 
the same genes tended to be upregulated or downregu- 
lated across cancer types, and then asked whether they 
belonged to defined gene-ontology (GO) categories. As 
shown in Figure 4, a majority of the genes were hyper- 
methylated and downregulated, and a smaller fraction of 
the genes showed the opposite pattern. We found no sig- 
nificant overlap between the genes that were hyper- 
methylated and downregulated in mammary, lung or in 
the sets of various cell types. These genes also showed no 
significant GO clustering. Thus, the hypermethylated 
genes might be cancer-type-specific (for example, 
cell-type-specific genes that are downregulated in the 
derived cancer). By contrast, the hypomethylated,upregu- 
lated genes significantly repeated between cancer types 
(Figure 4C, left), and 58 of them appeared in all cancer 
types examined (P < e , compared with the null expec- 
tation of 2.3 overlapping genes; Table S3 in Additional 
file 1). GO analysis showed that the hypomethylated and 
upregulated genes were frequently involved in cell grow- 
ing-related functions (Figure 4C right; an extended list of 
significant GO clusters is given in Table S4 in Additional 
file 1). Thus, the hypomethylated enhancers associate 
with genes that are active in many cancer types and sup- 
port cell proliferation. 

An interesting question raised from the above is 
whether enhancer methylation may also distinguish 
between cancer and normally- dividing cells. To examine 
this, we compared Epstein-Barr virus-immortalizedlym- 
phoblastoid(GM12878)cells, which are rapidly growing in 
culture but otherwise have normal karyotype and no par- 
ticular characteristics of cancer cells, with lymphoblastoid 
(Jukat) cells derived from an acute leukemia. We identi- 
fied 74 genes that were hypomethylated and upregulated 
in the leukemia compared to the normal lymphoblas- 
toids. As expected, these were not proliferative genes - 
the proliferative genes were already active in the dividing 
normal cells, so they were not expected to show differ- 
ences compared to the cancer. Instead, this analysis cap- 
tured genes that were involved in cancer-related processes 
other than cellproliferation, such as the PRAME gene, 
which is believed to inhibit myeloid differentiation in 



certain myeloid leukemias[33] (Table S5 in Additional 
file 1). Of the 74 upregulated genes, 31 were also hypo- 
methylated and upregulated in the sets of normal versus 
cancer cell types (Figure 4), a significantly higher number 
than expected by chance (P = 7e ). Thus, enhancer hypo- 
methylation may associate with upregulation of genes 
involved in a variety of cancer-related pathways. 

Enhancer methylation maybe specifically altered in cancer 

Finally, we asked whether the observed perturbations of 
enhancer methylation in cancers (Figure 4) are fully 
explained by a global perturbation of the DNA methyla- 
tion blueprint in cancer cells, or are the outcome of 
enhancer-specific processes. For this, we first analyzed the 
overall differences between the methylation of normal and 
cancer cells. As previously reported, methylation levels 
along the genomes of normal cells from a given tissue (left 
scatter in Figure 5A), or even from different tissues (left 
scatter in Figure 5B), generally resemble each other. Can- 
cer cells, however, are grossly differentiated from normal 
cells: sites that are normally unmethylated tended to gain 
methylation, sites that were normally methylated tended 
to lose methylation, and partially-methylated regions 
shifted to both directions, accumulating at the extremities 
of the methylation scale (since larger and smaller shifts 
must converge at the borders). Due to this effect, cancers 
are overall more similar to each other than to the normal 
tissues (Figure 5B). We also analyzed the overall shift in 
the methylation of particular genomic sections. As 
expected, promoters, which are generally unmethylated in 
normal cells, tended to gain methylation in the cancer 
samples (Figure 5C), while the rest of the genome was 
slightly hypomethylated (Figure 5D). Surprisingly, high- 
scoring enhancer sites demonstrated the most dramatic 
shift between normal and cancer cells: they were partly 
methylated in the normal cells, and become highly methy- 
lated in the cancer (Figure 5E). 

With these baseline analyses in hand, we then asked 
whether any particular group of methylation sites 
behaved as expected from the global trends, or stand 
alone. Figure 5F,G shows the average alteration of a 
given genomic site in cancer, as a function of its methy- 
lation level in the normal cells. For example, if a given 
site was 10% methylated in the normal cells, its average 
methylation shift in the cancer, based on the analyses 
shown in Figure 5A,B, was about plus 30% methylation. 
We found that promoters (green lines in Figure 5E,G) 
were behaving more or less as expected from the global 
trend of the genome (black lines). In other words, the 
alteration of promoter methylation in cancer was similar 
to the alteration of non-promoter sites with similar nor- 
mal methylation levels. Looking more carefully into the 
promoter classes, it appeared that active promoters 
tended to maintain low methylation against the global 
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Figure 5 Enhancer methylation maybe specifically altered in cancer (A,B)x-y scatters showing the methylation of sites across the genome 
in a given cell type (or a collection of cell types) versus another cell type (or types).(A)Normal and cancerous lung epithelia: genome-wide 
methylation of normal versus normal (left) or of cancer versus normal (middle and right) cell types. The given cell types are indicated, as listed in 
table S2. (B) Normal and cancerous cell types of various tissue origins: genome-wide methylation of nine normal cell types of various tissue 
origins, versus another nine cell types (left), of 18 cancer versus 18 normal cell types (middle), or of nine cancers versus other cancers (right). (C- 
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mammary epithelial cells and MCF7 cells shown in A. (G)Same analysis as in F, but for 18 normal and 18 cancer cell types shown in B. 
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trend, whereas poised (polycomb-repressed) promoters 
gained more methylation than expected from the global 
trend. Yet, the overall alteration of promoter methylation 
in cancer was relatively close to that expected from the 
global trend. The shift in the methylationof the high- 
scoring enhancers, however, was clearly beyond what 
could be expected from the global trend (Figure 5F,G). 
Thus, the alteration of the high-scoring enhancer sites in 
cancer is probably not simply the outcome of the global 
loss of methylation control in cancer cells. Rather, these 
sites may be under the effect of a more specific mechan- 
ism, such as targeted methylation or demethylation and/ 
or cell selection. 

Conclusions 

The mapping of associationsbetween distal regulatory 
sites and the genes they control is a challenging task, 
which only recently began to be confronted on the gen- 
ome-wide scale. Attempts to predict gene-enhancer pairs 
were based on the profiling of chromatin states or tran- 
scription factor binding [27,34], or of long-range DNA 
looping [28,29] .Here, we show that enhancers can be also 
associated with genes using DNA methylation. In con- 
trast to the above mapping approaches, methylation data 
are readily available and are highly quantitative, and thus 
may enhance mapping of gene-enhancer pairing. 

We found that distal expression-related methylation sites 
are abundant in the human genome, co-localizing with 
enhancer chromatin marks, and are more predictive of 
expression levels then promoter methylations. While not 
all distal regulatory sites in the genome must exhibited pro- 
moter-like methylation, we showed that a large number of 
enhancer sites demonstrate reverse correlation between 
methylation and expression, as in gene promoters. 

We have further shown that hypomethylation state is 
directly related to enhancer activity across cell types 
(Figure S3 in Additional file 1). The observation that 
hypomethylated enhancers bind more transcription fac- 
tors than methylated ones (Figure 2E) suggests a possi- 
ble mechanism underlying the connection between 
DNA methylation and enhancer activity. Consistent with 
this possibility, high-scoring enhancers are particularly 
enriched within a defined chromatin state (Figures 2D), 
which is particularly hypomethylated compared to non- 
regulatory chromatin (Figure S3Cin Additional file 1). 
Whether this chromatin state holds particularly active 
enhancers, or perhaps a unique class of methylation- 
related enhancers, remains to be elucidated. 

The range of cell types we analyzed in this study was 
determined by the availability of methylation and expres- 
sion data. In addition, the RRBS and Infinium Human- 
Methylation450 BeadChip methylation data we used 
provide limited genomic coverage and are strongly biased 
towards promoters and certain other portions of the 



genome, while enhancers are not efficiently targeted. 
Because of this, it is likely that more complete methylo- 
mic coverage will expose many additional enhancer-gene 
pairs. Whole-genome bisulfite sequencing approaches 
have recently become popular and whole methylome 
analyses of human tissues are rapidly accumulating. 
Utilizing our approach, these additional data should 
allow the production of more comprehensive maps of 
enhancer-gene pairing across tissues, cell types and 
conditions. 

Unmethylated promoters are permissive for, but do not 
necessarily determine, transcription initiation. We 
showed that enhancer methylation associates with cell- 
type-specific expression levels, even when the promoter 
is constantly unmethylated (Figure 3). Moreover, enhan- 
cer methylation characterizes small (and larger) expres- 
sion differences. Thus, enhancers are not just on-off 
switches of cell-type transcription levels, as previously 
suggested, but may also mediate ranges of expression 
levels across multiple cell types. In contrast to the tradi- 
tional model of one enhancer site per cell type, we sug- 
gest that a gradient of methylation states at a single 
enhancer site may direct distinct expression levels in 
many different cell types (Figure 3C). 

In occasional examples, enhancer methylation level has 
been suggested to be associated with the control of can- 
cer-related genes [35-37]. However, to our knowledge this 
is the first report on a global association between per- 
turbed enhancer methylation and aberrant expression of 
cancer genes. We have shown that hypomethylated enhan- 
cers associated with the upregulation of many cancer 
genes controlling various cellular functions (Figure 4C), 
some of them involved in cell proliferation and some in 
other cancer-related processes. Moreover, many of these 
hypomethylated genes were found in most cancer types 
examined, suggesting a pan-cancer mechanism. However, 
the larger group of hypermethylated enhancers seemed to 
target cancer-type-specific genes. Given the limited geno- 
mic coverage of this study, many additional cancer-related 
enhancers are expected in the genome. 

To date, almost all studies of cancer-related methyla- 
tion have focused on gene promoters and CpG islands. 
Among these, the predominant event in cancers is hyper- 
methylation of polycomb-repressed promoters [9-11]. 
This hypermethylation does not directly affect expression 
levels, as the associated genes are inactive in the normal 
tissue and generally remain inactive in the cancer 
(although it may limit the potential for re-activation of 
silenced genes in the cancer). Here, we established a very 
different occurrence in the other large group of regula- 
tory sites - the transcriptional enhancers. These sites are 
drastically altered in cancers, to both hypo- and hyper- 
methylation, and are closely related to substantial modifi- 
cations in the expression levels of cancer-related genes 
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(Figure 4). Moreover, their aberrant methylation in can- 
cers might derive from targeted methylation or demethy- 
lation or from selection of the altered cells (Figure 5). 
Whether targeted or selected, aberrant enhancer methy- 
lation may be involved in important events during cancer 
development. 

We have uncovered a class of distal methylation sites 
that closely describe cell-type transcription levels. These 
sites reside in a particular subclass of transcriptional 
enhancers and are associated with cell-type-specific 
enhancer activity, possibly through communication with 
the binding of transcription factors. Methylation levels 
of these enhancers associate with gradual expression dif- 
ferences across cell types, even when the linked promo- 
ters are consistently unmethylated across the cell types. 
The radical changes in methylation of these sites in can- 
cer is beyond that expected from the general profile of 
the cancer methylome, and may reflect specific targeting 
of the methylation and demethylation machinery to 
these sites, and/or functional contribution to tumor 
development. Further analyses of these sites may provide 
crucial information about paradigms of gene expression 
control in normal and cancerous cells. 

Materials and methods 

Imported datasets 

The following datasets were downloaded from the 
ENCODE website [38]: 

1. DNA methylation data for 58 cells types, for which 
expression data were also available. Methylation data 
were the average of two RRBS experiments (52 of the 
cell types, as detailed in Table SI in Additional file 1) or 
Infinium HumanMethylation450 BeadChip data (36 of 
the RRBS cell types plus additional six cell types for 
which only BeadChip data were available). 

2. Expression data for the above cell types produced by 
the Affymetrix Human Exon array. Gene-level expression 
data for 17,862 genes was extracted from the raw data 
following Robust Multi-chip Average normalization. 

3. Histone chromatin immunoprecipitation (ChlP)- 
sequencing peaks and ChromHMM annotation data avail- 
able for six of the above cell types, including K562, human 
mammary epithelial cells, human skeletal muscle myo- 
blasts, normal human lung fibroblasts, GM12878 and 
HepG2 cells. 

4. ChlP-sequencing peaks data for 122 DNA binding 
factors in various cell types. 

Human whole-genome methylation data were from 
Lister et al. [39]. Genomic locations are according to 
human genome version hgl9 of the human genome. 
Gene sizes, start and end sites, promoters and alterna- 
tive promoters are according to the RefSeq database 
(National Center for Biotechnology Information, 
Bethesda, MD, USA). Methylation and expression data 



for somatic tissues of two individuals are from [30] 
([GSE:30654] patients #1 and 2). 

Data filtering and categorization 

VMSs were CpGs with standard deviation >5 after omit- 
ting outliers utilizing the Thomson's Tau method (alpha = 
0.1%). A 'sample' (gene-CpG pair) was the two-dimen- 
sional matrix describing the methylation levels of a given 
CpG site as a function of the expression levels of a given 
gene across the cell types. For the RRBS data,only methy- 
lation sites that were sequenced at least lOtimes in at least 
60% (80% for the Support Vector Machine (SVM) algo- 
rithm) of the cell types were included. 

Machine learning 

The following fourfeatures were extracted for each CpG- 
gene sample: positive Pearson coefficient (0 if negative); 
negative Pearson coefficient (0 if positive); positive Spear- 
man coefficient (0 if negative); and negative Spearman 
coefficient (0 if positive). Based on these features, we 
learned a model distinguishing false from true samples, 
applying a machine-learning algorithm. The SVMmap 
algorithm is a SVM algorithm for predicting rankings. It 
performs supervised learning using binary labeled train- 
ing examples, with the goal of optimizing mean average 
precision (MAP) [40]. We used SVM-MAP for ranking 
the CpG-gene samples in 3,434 gene queries (RRBS data 
in 52 cell types) containing 15,205 true samples and a 50- 
fold excess of randomly selected false samples, where the 
goal wasfitting a model that ranks the true samples above 
the false samples. The regularization parameter was opti- 
mized by dividing the gene queries into training and test 
sets (4:1 ratio) and performing 10-fold cross-validation. 
Query P -values (the probability of obtaining the top 
ranked true sample by chance) were calculated for the 
test set queries as follows: 

We denoteX = min T (rank), where T represents the 
true samples. The event that X = x is equivalent to 
selecting |T| — 1 samples from {x + l,x + 2, . . . ,n}. 
Therefore the probability of this event can be obtained 
by the hypergeometric distribution: 



Pr (X = x) 



n — x 
\T\-1 



And the probability that X < minx (rank) is: 



Pr(X<x) = J2 



n — i 
\T\- 1 

n 
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The resulting model for the methylation-expression 
relationship consisted of the following learned features: 
positive Pearson coefficient = -0.0058, negative Pearson 
coefficient = 0.0104, positive Spearman coefficient = 
-0.0038, and negative Spearman coefficient = 0.0088. 

Applying the learned model to the datasets 

The model learned in gene promoters was applied to all 
CpGs in 17,862 gene intervals (1 Mb upstream, gene- 
body, 1 Mb downstream to each gene), resulting in 
14,702,075 CpG-gene pairs. The scores were normalized 
to a 0 to 1 range(c e C a CpG site, jeG a gene): 

. (model, features(c, g)) — min(s(C,G)) 

Score (eg) = 

v 67 max (s (C, G)) - mm (s (C, G)) 

Data analyses 

Frequency map of high-scoring sites 

The distribution of methylation-expression relationships of 
the high-scoring sites over the cell types is a contour repre- 
sentation of the model learned to distinguish true from 
random pairs. The frequency map shown in Figure 2B is a 
representation of the corresponding scatter plot (SVM- 
MAP for 25 x 25 grids of high-scoring pairs), with smooth- 
ing of each point around a sphere (radius = 3). 
Enrichment of chromatin marks around the high-scoring 
sites 

ChlP-seq data ('modification score') were downloaded 
from the ENCODE website, averaged across the analyzed 
cell types, normalized to a 0 to 1 scale, and averaged 
across the high-scoring methylation sites. Loess smoothing 
with a span of 10% was applied to the presented data. 
Enrichment of chromatin states around the high-scoring 
sites 

ChromHMM states were identified for the CpG sites 
across the cell types. (A state was called when a given site 
was found in a given state in at least one of the cell types. 
Sites may be related to more than one state). The actual 
number of sites in a given chromatin category and a 
given score level ('real'), was divided by the number 
obtained from shuffling the expression data between 
genes 10 times ('random'). P -values were calculated 
based on the normal distribution of the shuffled data. 
Binding of transcription factors to the high-scoring sites 
ChlP-sequencing peaks data for 122 DNA binding factors 
in fourcell types(GM12878, HepG2, HeLaS3, K562 were 
downloaded. For every high-scoring site we counted the 
number of factors that bind, and averaged across the cell 
types. Random expectation and real versus random P- 
values were obtained as above. The P -value for the dif- 
ference between methylated and unmethylated sites was 
calculated using the Wilcoxon rank-sum test for differ- 
ence between averages. 



Conserved sequences around the high-scoring sites 

PhastConsdata (phastCons46wayPrimates and phast- 
Cons46wayPlacental) were downloaded from the Univer- 
sity of California, Santa Cruz genome browser and 
plotted around the high-scoring sites. 
Agreement with gene-enhancer pairing by long-range 
chromatin interactions 

From 5C data produced at the University of Massachu- 
setts in threecell lines (Gml2878, HeLA-S3, K562) we 
sorted out the enhancer sitesthat were probed by the 5C, 
for which we also have at least one CpG site available to 
our analysis at±500 base pairs from the probed enhancer. 
We then located the TSS that obtained the highest score 
in relation to the enhancer CpG. This yielded 318 inter- 
actions, at distances of 10 kb to 1 Mb between the CpG 
side and the TSS side. We then counted the number of 
cases in which the maximum number of interaction 
sequence reads (average of two repeats over three cell 
types) washigher than the average interaction read num- 
ber (98.23 reads). P -values were obtained by comparing 
the binomial probability of the agreement in high versus 
low scoringsites. We also compared the number of 5C 
interaction reads inhigh-scoring (score>0.85) enhancer- 
TSS pairs, versusthe number interactions between these 
enhancers and other TSSs at 10 kbto 1 Mb from the 
enhancer sites. The P -value of the difference was calcu- 
lated by the Wilcoxon Rank- Sum test. 
Replication in uncultured samples 

Methylation and expression data for bladder, lymph 
node, ureter, lung, stomach, skeletal muscles and adipose 
tissue biopsies of twoindividuals were downloaded from a 
recently-published dataset [GSE:30654]. Out of the 
enhancer-promoter interactions that were predicted in 
the original set of the 58 cell types, 876 were also assessa- 
ble in the new dataset (that is, methylation data were 
available for the relevant CpG sites, and expression data 
were available for the interacting genes). We re-scored 
these gene-CpG pairs, using our model, in the new data- 
set. If the interactions that were predicted in the 58 cell 
types were not relevant to interactions in the uncultured 
tissue samples, than we expected that 20.5 ±4.5out of the 
876 would (by chance) obtain high-scoring (score>0.9) 
interactionsin individual #1, and 19.9 ±3.5in individual 
#2. Instead, we observed 29 in individual #1 {P = 0.0145) 
and 28 (P = 0.0084) in individual #2. The P-value (normal 
distribution, two-sided) indicated in the result and dis- 
cussion section is for the average of both individuals 
(expected = 20.21±3.206, observed = 28.5 interactions, 
P = 0.0098). 
Alteration in cancer 

Differences betweennormal and cancer cells were mea- 
sured by Robust Multi-chip Average expression units 
and by methylation percentages. The distributions of 
methylation and expression differences between normal 
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and cancer samples (Figure 4) were smoothed (color 
density representation) using a kernel density estimate 
(transformation function = x A 3). Linear regressions were 
performed (red lines and Pearson coefficients in Figure 
4). The P-value of the frequencies of signals in the 
upper-left and lower-right quarters, in enhancers versus 
promoters, were calculated using normal distribution of 
two-proportion z test. P -valuesof the overlap between 
genes in the cancer types were calculated based on the 
expectation from random intersections between the 
groups. 

Gene ontology analysis 

Analysis of GO terms of thehypomethylatedupregulated 
genes was done with the GOrilla tool [41], using the list 
of high-scoring genes as a background. 

Additional material 
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